In this lab we analyse spatial data from a study investigating the spatial pattern in cancer risk in Greater Glasgow, Scotland, between 2001 and 2005. The study region is the Greater Glasgow and Clyde health board, which contains the city as well as the surrounding area. The health board is split up into \(n=271\) administrative units called Intermediate Geographies (IG), which have a median area of 124 hectares and a median population of 4,239. The disease data we will model are the numbers of new cancer cases diagnosed between 2001 and 2005 for each of the 271 intermediate geographies that comprise our study region. The expected numbers of cases are calculated by indirect standardisation, using age and sex adjusted rates for the whole of Scotland. A small number of covariates are available to describe the spatial variation in cancer risk across Greater Glasgow, and the data for this study come in the following two parts.
glasgowdata.csv - contains the response and covariate data. The data are as follows:
SG - a shapefile containing multiple files with the same name but different file extensions (e.g. .shp, .dbf, .shx, etc) that contains the polygons that make up the set of Intermediate Geographies (IG) for all of Scotland.
To model areal data in R we use the CARBayes package (Lee (2013)). The package has a user guide that is provided with this lab handout.
Aim
In analysing these data our aims are to:
First read in the data using the commands:
#### Read in the data
dat <- read.csv(file="glasgowdata.csv")
head(dat)
which shows that the data contain the variables listed above. As discussed in lectures the exploratory measure of disease risk is the standardised morbidity ratio (SMR), which for area \(i\) is given by
\[\mbox{SMR}_i~=~\frac{Y_i}{E_i}.\]
Compute the SMR for these data and add it as an additional column to the data.frame dat using the following code.
dat$smr <- dat$Y / dat$E
The shapefiles can be read in using the commands:
#### Read in the shapefile
library(sp)
library(rgdal)
shape <- readOGR(dsn = "SG.shp")
Warning in OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS =
dumpSRS, : Discarded datum OSGB_1936 in CRS definition: +proj=tmerc +lat_0=49
+lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
OGR data source with driver: ESRI Shapefile
Source: "/Users/duncanlee/OneDrive - University of Glasgow/teaching/Spatial statistics 2021/Computer labs/SG.shp", layer: "SG"
with 1235 features
It has 5 fields
This object shape is a SpatialPolygonsDataFrame object, which can be seen from
class(shape)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
This essentially combines the set of polygons representing the IGs, with limited pre-loaded data. The data element can be viewed by:
head(shape@data)
The next step is to transform the coordinate reference system from metres (easting and northing) to degrees (longitude and latitude), which makes it compatible with the data visualisation later on. This is achieved via the code:
proj4string(shape)
Warning in proj4string(shape): CRS object has comment, which is lost in output
[1] "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs"
shape2 <- spTransform(shape, CRS("+proj=longlat +datum=WGS84 +no_defs"))
proj4string(shape2)
Warning in proj4string(shape2): CRS object has comment, which is lost in output
[1] "+proj=longlat +datum=WGS84 +no_defs"
Then use the following code to combine the data set and shapefiles together.
sp.dat <- merge(shape2, dat, all.x=FALSE, by.x="IZ_CODE", by.y="IG")
class(sp.dat)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
The resulting object sp.dat is also a SpatialPolygonsDataFrame object. Note that if you look at the data and you plot the object you get the following results
head(sp.dat@data)
plot(sp.dat)
This raises 2 points:
Once the data have been read into R the natural first step is to draw a map, and in this example the SMR is the natural variable to visualise. R has a number of different packages for drawing maps, including sp and ggplot2, and we illustrate the ggplot2 package here. First load the required packages using the code:
library(ggplot2)
library(rgeos)
library(maptools)
Before you can draw a map, ggplot2 requires you to turn the sp.dat SpatialPolygonsDataFrame object into a data.frame. This can be done using the following code:
sp.dat@data$id <- rownames(sp.dat@data)
temp1 <- fortify(sp.dat, region = "id")
sp.dat2 <- merge(temp1, sp.dat@data, by = "id")
Then a basic map of the SIR can be created using the following code:
ggplot(data = sp.dat2, aes(x=long, y=lat, goup=group, fill = smr)) +
geom_polygon()
Here:
However, this map is unsatisfactory in a number of ways, and can be improved by adding additional commands to the ggplot() function separated by the + sign as shown below.
library(RColorBrewer)
ggplot(data = sp.dat2, aes(x=long, y=lat, goup=group, fill = smr)) +
geom_polygon() +
xlab("Longitude") +
ylab("Latitude") +
labs(title = "SMR for cancer risk", fill = "SMR") +
theme(title = element_text(size=14)) +
scale_fill_gradientn(colors=brewer.pal(n=9, name="Reds"))
Here the new lines make the following changes:
Finally, you can do an interactive data visualisation using the leaflet package as shown below, which overlays your data on an OpenStreetMap.
library(leaflet)
colours <- colorNumeric(palette = "OrRd", domain = sp.dat@data$smr, reverse=FALSE)
leaflet(data=sp.dat) %>%
addTiles() %>%
addPolygons(fillColor = ~colours(smr),
color="black", weight=1,
fillOpacity = 0.7) %>%
addLegend(pal = colours, values = sp.dat@data$smr,
opacity = 1, title="SMR") %>%
addScaleBar(position="bottomleft")
Note, that this data visualistaion tool works with the original SpatialPolygonsDataFrame object sp.dat.
When fitting a regression model it is spatial autocorrelation in the residuals after adjusting for any covariates that should be checked, so the next step is to fit a simple covariate model and check the residuals for spatial autocorrelation. Given the response data are counts, the following Poisson log-linear model is appropriate. Here the expected numbers of cases (\(E_i\)) is included as an offset term on the log-scale (as the linear predictor is on the log scale).
\[\begin{aligned} Y_i&\sim \mbox{Poisson}(\mu_i)\\ \ln(\mu_i) &= \ln(E_i) + \beta_1 + \beta_2 pm10_{i} + \beta_3 smoke_i + \beta_4 ethnic_{i}.\nonumber \end{aligned}\]
This model is fitted and the results visualised using the following code:
form <- Y~offset(log(E))+pm10+smoke+ethnic
model1 <- glm(formula=form, family=poisson, data=dat)
summary(model1)
Call:
glm(formula = form, family = poisson, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.1688 -0.9560 -0.0885 0.8845 4.0877
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5470147 0.0565970 -9.665 < 2e-16 ***
pm10 0.0227687 0.0035544 6.406 1.5e-10 ***
smoke 0.0082125 0.0006203 13.239 < 2e-16 ***
ethnic -0.0049302 0.0005853 -8.423 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 972.94 on 270 degrees of freedom
Residual deviance: 616.93 on 267 degrees of freedom
AIC: 2401.9
Number of Fisher Scoring iterations: 4
Moran’s I statistic for measuring the amount of spatial autocorrelation can be calculated using functionality from the spdep package. Specifically, the statistic is computed and a hypothesis test (where the null hypothesis \(H_0\) is independence) conducted using the moran.mc() function. However, first the spatial neighbourhood information is required in the form of a list type object, which is created from the sp.dat object using the following code:
library(spdep)
W.nb <- poly2nb(sp.dat, row.names = rownames(sp.dat@data))
W.list <- nb2listw(W.nb, style = "B")
Then the presence of spatial autocorrelation can be checked using the following code:
moran.mc(x = residuals(model1), listw = W.list, nsim = 10000)
Monte-Carlo simulation of Moran I
data: residuals(model1)
weights: W.list
number of simulations + 1: 10001
statistic = 0.084654, observed rank = 9894, p-value = 0.0107
alternative hypothesis: greater
From which it can be seen that correlation is present and hence needs to be modelled.
We allow for the spatial autocorrelation by fitting the extended model:
\[\begin{aligned} Y_i&\sim \mbox{Poisson}(\mu_i)\\ \ln(\mu_i) &= \ln(E_i) + \beta_1 + \beta_2 pm10_{i} + \beta_3 smoke_i + \beta_4 ethnic_{i} + \phi_i.\nonumber \end{aligned}\]
The random effects \(\phi_i\) are modelled by the CAR model proposed by Leroux, Lei, and Breslow (2000), which is given by
\[\phi_{i}|\boldsymbol{\phi}_{-i}, \mathbf{W}\sim\mbox{N}\left(\frac{\rho\sum_{j=1}^{K}w_{ij}\phi_{j}}{\rho\sum_{j=1}^{K}w_{ij} + 1-\rho}, \frac{\tau^{2}}{\rho\sum_{j=1}^{K}w_{ij} + 1-\rho}\right),\]
where here \(\rho\) is a spatial dependence parameter with
This model can be fitted in a Bayesian setting using Markov chain Monte Carlo (MCMC) simulation using the S.CARleroux() function from the CARBayes package, which requires (at a minimum) the following arguments.
The only one of these we need to construct is \(\texttt{W}\), the adjacency or neighbourhood matrix, which can be constructed from the spatial object using the following code:
W <- nb2mat(W.nb, style = "B")
The model can be fitted using the following code, where the print() function prints a summary of the model to the screen. The verbose=FALSE argument stops the function updating the user on its progress, which is purely done to make this document look nice! I recommend setting verbose=TRUE (the default value) so you can see how long the function has left to run.
library(CARBayes)
model2 <- S.CARleroux(formula=form, family="poisson", data=sp.dat@data,
W=W, burnin=20000, n.sample=120000, thin=10, verbose=FALSE)
print(model2)
#################
#### Model fitted
#################
Likelihood model - Poisson (log link function)
Random effects model - Leroux CAR
Regression equation - Y ~ offset(log(E)) + pm10 + smoke + ethnic
Number of missing observations - 0
############
#### Results
############
Posterior quantities and DIC
Median 2.5% 97.5% n.effective Geweke.diag
(Intercept) -0.5478 -0.8064 -0.3088 880.4 1.8
pm10 0.0235 0.0088 0.0396 785.7 -1.7
smoke 0.0073 0.0052 0.0095 1288.5 0.4
ethnic -0.0047 -0.0066 -0.0028 1230.2 0.1
tau2 0.0215 0.0131 0.0335 2667.1 1.1
rho 0.3747 0.1000 0.7295 1706.5 0.7
DIC = 2209.068 p.d = 141.9742 LMPL = -1136.07
The output from the print() function is split into 2 sections. The first section Model fitted displays the model that has been fitted, which includes the choice of covariates, the data likelihood model and the random effects model. The second section presents the results, which includes both parameter summaries for key parameters and overall model fit criteria such as the Deviance Information Criterion (DIC, Spiegelhalter et al. (2002)) with the effective number of parameters (p.d). The summary table of the key model parameters (all parameters except the random effects \(\boldsymbol{\phi}\)) contains the following information:
The fitted model object model2 is an R list object, which contains the following elements as shown via the summary function.
summary(model2)
Length Class Mode
summary.results 42 -none- numeric
samples 6 -none- list
fitted.values 271 -none- numeric
residuals 2 data.frame list
modelfit 6 -none- numeric
accept 4 -none- numeric
localised.structure 0 -none- NULL
formula 3 formula call
model 2 -none- character
X 1084 -none- numeric
A description of the key elements in this list is given below.
The convergence of the MCMC samples can be assessed by viewing traceplots of the samples for certain parameters. The set of samples for a given parameter have converged if they show no trend and random scatter above and below the average value. The samples are stored in the samples element of the R list object model2, which for this model has elements
summary(model2$samples)
Length Class Mode
beta 40000 mcmc numeric
phi 2710000 mcmc numeric
tau2 10000 mcmc numeric
rho 10000 mcmc numeric
fitted 2710000 mcmc numeric
Y 1 mcmc logical
which correspond to the different parameters in the model. For example, to plot the traceplot for the spatial dependence parameter \(\rho\) use the following code.
plot(model2$samples$rho)
The left plot is the traceplot which shows no trend and hence convergence, while the right plot shows a density estimate of the samples. Additionally, these samples show the estimated value of rho (\(\rho\)) is close to 0.4, suggesting the spatial dependence in these data after adjusting for the covariates is moderate.
One element of interest from fitting this model are the effects of the covariates on disease risk, which are typically presented as relative risks. For example, estimated relative risks and 95% credible intervals for a 1 unit increase in each covariate can be obtained via the code:
exp(model2$summary.results[2:4 , 1:3])
Median 2.5% 97.5%
pm10 1.023778 1.0088388 1.0403945
smoke 1.007327 1.0052135 1.0095453
ethnic 0.995311 0.9934217 0.9972039
So for example, a 1 unit increase in the PM\(_{10}\) concentrations is associated with a 2.4% increase in disease risk. But why does PM\(_{10}\) have a larger increase compared to smoking, which is counter-intuitive?
The reason is that the variation in these variables is different, for example:
summary(dat$pm10)
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.10 15.97 17.30 17.14 18.43 20.50
summary(dat$smoke)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.50 22.55 29.60 29.67 36.40 52.40
So a 1 unit increase in PM\(_{10}\) covers more of the variation in that variable than a 1 unit increase in smoking. Computing the relative risks for a standard deviation increase in each one gives:
exp(sd(dat$pm10) * model2$summary.results[2 , 1:3])
Median 2.5% 97.5%
1.042992 1.015888 1.073508
exp(sd(dat$smoke) * model2$summary.results[3 , 1:3])
Median 2.5% 97.5%
1.072884 1.051390 1.095874
So now smoking has the larger effect as expected.
References
Geweke, John. 1992. “Evaluating the Accuracy of Sampling-Based Approaches to the Calculation of Posterior Moments.” In Bayesian Statistics, 169–93. University Press.
Lee, D. 2013. “CARBayes: An R Package for Bayesian Spatial Modelling with Conditional Autoregressive Priors.” Journal of Statistical Software 55: 13.
Leroux, Brian G., Xingye Lei, and Norman Breslow. 2000. “Statistical Models in Epidemiology, the Environment, and Clinical Trials.” In, 179–91. Springer-Verlag, New York. http://dx.doi.org/10.1007/978-1-4612-1284-3_4.
Spiegelhalter, D, N Best, B Carlin, and A Van der Linde. 2002. “Bayesian Measures of Model Complexity and Fit.” Journal of the Royal Statistical Society B 64: 583–639.